A number of researchers have suggested that one of the limitations of LDA is that it cannot identify how many topics are in a corpus, leaving this decision to the human user (Yau et al., 2014 and Suominen and Toianen, 2016). Indeed, there is no way to identify the "correct" number of topics in advance of building the topic model (Carter et al., 2016). If the user specifies too few topics for the model, then the topics will be too general and useless for exploratory analysis or information retrieval. By contrast, if the user specifies too many topics for the model, the topics will be too specific, or redundant, to be of use; also, too many topics makes the interpretation of the model unwieldy. Therefore, most users experiment with the number of topics and make qualitative evaluations about which number of topics is most useful (Chang et al., 2016). Ultimately the right choice about the number of topics is dependent upon the way in which the model is going to be used (Carter, et al., 2016). As such, the ratio of documents (n) in a corpus to topics (k) to be extracted from the corpus ranges widely. Just to provide a few examples:
Here, I analyze the properties of topic models each of which have a different number of topics:
model_25_topics: This model has 25 topics, is based on a noun-only corpus, and has the alpha value set to symmetric.model_75_topics: This model has 25 topics, is based on a noun-only corpus, and has the alpha value set to symmetric.model_150_topics: This model has 150 topics, is based on a noun-only corpus, and has the alpha value set to symmetric.from gensim import corpora, models, similarities
import pyLDAvis.gensim
import json
import spacy
path = '../noun_corpus/'
# load metadata for later use
with open('../data/doc2metadata.json', encoding='utf8', mode='r') as f:
doc2metadata = json.load(f)
# load dictionary and corpus for the noun models
dictionary = corpora.Dictionary.load(path + 'noun_corpus.dict')
corpus = corpora.MmCorpus(path + 'noun_corpus.mm')
# load model_25_topics
model_25_topics = models.ldamodel.LdaModel.load(path + 'noun_25.model')
# load model_75_topics
model_75_topics = models.ldamodel.LdaModel.load(path + 'noun_75.model')
# load model_150_topics
model_150_topics = models.ldamodel.LdaModel.load(path + 'noun_150.model')
model_25_topics¶model_25_viz = pyLDAvis.gensim.prepare(model_25_topics, corpus, dictionary)
pyLDAvis.display(model_25_viz)
model_25_topics produced 4 topics which lack semantic or contextual coherence, 2 topics of mixed coherence, and 19 topics which are coherent. Therefore its topics are:
To illustrate what is meant by each category, consider the following examples:
Examples of junk topics:
Example of mixed topic:
Examples of coherent topics:
model_75_topics¶noun_75_viz = pyLDAvis.gensim.prepare(noun_75, corpus, dictionary)
pyLDAvis.display(noun_75_viz)
model_75_topics produced 13 topics which lack semantic or contextual coherence, 5 topics of mixed coherence, and 57 topics which are coherent. Therefore its topics are:
A number of the topics from the model_25_topics reappear in the model_75_topics. However, some topics, such as topic 9 from model_25_topics appear to be given more nuance in model_75_topics, for example:
model_75_topics also introduces many new coherent topics not found in model_25_topics, for example:
model_150_topics¶model_150_viz = pyLDAvis.gensim.prepare(model_150_topics, corpus, dictionary)
pyLDAvis.display(model_150_viz)
model_150_topics produced 33 topics which lack semantic or contextual coherence, 8 topics of mixed coherence, and 109 topics which are coherent. Therefore its topics are:
The coherent topics found in the previous models are present in the model_150_topics, but a large number of other coherent topics are added, for example:
model_25_topics contained 76% coherent topics, model_75_topics contained 77% coherent topics and model_150_topics contained 73% coherent topics. So, relative to the number of topics in each model, the performance was similar. However, given the raw numbers, model_150_topics contains far more coherent topics than either of the other two models. This suggests that model_150_topics provides a more nuanced model of the corpus. Topics which did not register in the other models, such as topic 78 (holiness code) and topic 112 (patristics), are revealed in model_150_topics. The utility of having nuanced topics needs to be wighed against the difficulty of keeping track of so many topics while doing an exploritory analysis of a corpus; nuance comes at the cost of efficenty.
def cluster_test(corpus, model):
docs_with_1_topic = 0
docs_with_multiple_topics = 0
docs_with_no_topics = 0
total_docs = 0
for doc in corpus:
topics = model.get_document_topics(doc, minimum_probability=0.20)
total_docs += 1
if len(topics) == 1:
docs_with_1_topic += 1
elif len(topics) > 1:
docs_with_multiple_topics += 1
else:
docs_with_no_topics += 1
print('Corpus assigned to a single topic:', (docs_with_1_topic / total_docs) * 100, '%')
print('Corpus assigned to multiple topics:', (docs_with_multiple_topics / total_docs) * 100, '%')
print('corpus assigned to no topics:', (docs_with_no_topics / total_docs) * 100, '%')
model_25_topics¶cluster_test(corpus, model_25_topics)
model_25_topics¶cluster_test(corpus, model_75_topics)
model_150_topics¶cluster_test(corpus, model_150_topics)
model_25_topics outperforms the other two models in that it only left 1.47% of documents unassigned to a topic. By contrast, model_75_topics left 14.59% of documents assigned and model_150_topics left 28.14% of the documents unassigned. Additionaly, although model_25_topics assigned fewer documents to a single topic than the other two models, it assigned more far more documents to multiple topics, thus providing a more robust clustering system where a document may belong to more than one topic.
# build indicies for similarity quiries
index_25 = similarities.MatrixSimilarity(model_25_topics[corpus])
index_75 = similarities.MatrixSimilarity(model_75_topics[corpus])
index_150 = similarities.MatrixSimilarity(model_150_topics[corpus])
# define retrieval text
def retrieval_test(new_doc, lda, index):
new_bow = dictionary.doc2bow(new_doc) # change new document to bag of words representation
new_vec = lda[new_bow] # change new bag of words to a vector
index.num_best = 10 # set index to generate 10 best results
matches = (index[new_vec])
scores = []
for match in matches:
score = (match[1])
scores.append(score)
score = str(score)
key = 'doc_' + str(match[0])
article_dict = doc2metadata[key]
author = article_dict['author']
title = article_dict['title']
year = article_dict['pub_year']
print(key + ': ' + author.title() + ' (' + year + '). ' + title.title() + '\n\tsimilarity score -> ' + score + '\n')
average_score = sum(scores) / len(scores)
print('*********************************')
print("Average similarity score ->", average_score)
# set up nlp for new docs
nlp = spacy.load('en')
stop_words = spacy.en.STOPWORDS
def get_noun_lemmas(text):
doc = nlp(text)
tokens = [token for token in doc]
noun_tokens = [token for token in tokens if token.tag_ == 'NN' or token.tag_ == 'NNP' or token.tag_ == 'NNS']
noun_lemmas = [noun_token.lemma_ for noun_token in noun_tokens if noun_token.is_alpha]
noun_lemmas = [noun_lemma for noun_lemma in noun_lemmas if noun_lemma not in stop_words]
return noun_lemmas
# load and process Greene, N. E. (2017)
with open('../abstracts/greene.txt', encoding='utf8', mode='r') as f:
text = f.read()
greene = get_noun_lemmas(text)
#load and process Hollenback, G. M. (2017)
with open('../abstracts/hollenback.txt', encoding='utf8', mode='r') as f:
text = f.read()
hollenback = get_noun_lemmas(text)
# load and process Dinkler, M. B. (2017)
with open('../abstracts/dinkler.txt', encoding='utf8', mode='r') as f:
text = f.read()
dinkler = get_noun_lemmas(text)
model_25_topics¶retrieval_test(greene, model_25_topics, index_25)
model_75_topics¶retrieval_test(greene, model_75_topics, index_75)
model_150_topics¶retrieval_test(greene, model_150_topics, index_150)
The average similarity score for the top ten results for the first information retrieval task are as follows:
model_25_topics: average similarity score -> 91.8%model_75_topics: average similarity score -> 87.7%model_150_topics: average similarity score -> 76.1%Two documents from the corpus were matches with the Greene article in all three models:
doc_2855 shows up as the 3rd highest match in model_25_topics (similarity score of 92.9%) and model_150_topics (similarity score of 78.2%), but as the 7th highest match in the model_75_topics (similarity score of 87.1%). doc_8205 shows up as the 7th highest match in model_25_topics (similarity score of 90.4%) and model_75_topics (similarity score of 87.1%), but as the 10th highest match in model_150_topics (similarity score of 72.2%).
model_25_topics¶retrieval_test(hollenback, model_25_topics, index_25)
model_75_topics¶retrieval_test(hollenback, model_75_topics, index_75)
model_150_topics¶retrieval_test(hollenback, model_150_topics, index_150)
The average similarity score for the top ten results for the second information retrieval task are as follows:
model_25_topics: average similarity score -> 82.2%model_75_topics: average similarity score -> 70.8%model_150_topics: average similarity score -> 56.5%model_25_topics returned results having to do with biblical law and rabbinic interpretation with an average similarity score of 82.2%. model_75_topics returned results that focus primarily on issues of gender and sexuality with an average similarity score of 70.8%. Finally, model_150_topics returned results focusing on translation issues with an average similarity score of 56.5%. Clearly, each model understands this article differently. All three themes--law, gender/sexuality, and translation issues-- are present in the article, so in a sense each model is useful. However, and interestingly, none of these models returned the article to which the present one is a response: Walsh, J.T. (2001). Leviticus 18:22 and 20:13: Who is Doing What to Whom? Journal of Biblical Literature, 120, 201-9.
model_25_topics¶retrieval_test(dinkler, model_25_topics, index_25)
model_75_topics¶retrieval_test(dinkler, model_75_topics, index_75)
model_150_topics¶retrieval_test(dinkler, model_150_topics, index_150)
The average similarity score for the top ten results for the third information retrieval task are as follows:
model_25_topics: average similarity score -> 97.7%model_75_topics: average similarity score -> 83.2%model_150_topics: average similarity score -> 78.0%Each topic model retrieved docuuments dealing with the gospels which on a general level is fitting for this article. There is one document from the corpus which was retireved by all three models:
The model_25_topics ranked this as the 8th highest match (similarity score of 97.4%) whereas both model_75_topics and model_150_topics ranked this as the 1st highest match(similarity scores of 86.7% and 80.0% respectivley). It may seem strange that these two models ranked this document as the highest match insofar as it is about the Gospel of John but the query article was about the Gospel of Luke, but the nuance provided by these models are picking up the themes of literary charactiorizartion.
The number of topics assigned to a topic model has a significant effect on the properties of the topic model. On the one hand, assigning more topics to a model allows the model group patterns of word co-occurrence in more nuanced ways. This allows for more specific, but possibly redundant topics. On the other hand, since each topic is distributed as a probability across each document (however low the probability may be), and since those probabilities must total 1.0, it follows that the more topics that are assigned, the lower, on average, those probabilities will be. How does this affect topic coherence, clustering, and information retrieval?
The three models examined here each produced approximately the same percentage of junk topics, mixed topics, and coherent topics. However, in terms of raw numbers, model_150_topics produced far more coherent topics than did the other models. The coherent topics from model_150_topics are also more specific than those provided by the other models. However, this larger number of coherent topics, and the level of specificity, comes at a cost; namely, more junk topics, potential redundancy in topics, and the difficulty of keeping track of so many topics (to say nothing of the additional computational time). So while the larger number of topics produces a larger amount of coherent topics, and is therefore useful for exploratory analysis of a large corpus, a researcher may decide on a middle ground between a small number of topics and a large number of topics as a way of balancing cost and benefit.
model_25_topics was able to cluster nearly the entire corpus of the JBL, whereas the other two models fell far short of that goal. Of course, such clustering is only useful it the topics are coherent. Assigning a smaller number of topics to a model allows the model to cluster more documents with a topic at a higher threshold (like the 20% threshold used here) because the values in the probability distribution will be higher. Therefore, a smaller number of topics is desirable if the goal of the topic model is cluster documents.
As far as information retrieval is concerned, the smaller the number of topics assigned to the model, the higher the similarity score will be for matching documents; but this does not necessarily mean that model_25_topics retrieved more relevant articles than did model_150_topics. Rather, since model_25_topics's are less specific, it may return only general matches at a higher similarity score. The topics from model_150_topics are more specific and will therefore return matches at a lower similarity score. In other words, model_25_topics may fail to return matches which would be useful to the researcher. model_75_topcis may be a nice compromise between general matches at a higher similarity score and specific matches at a lower similarity score. On the whole, model_75_topics, achieved higher similarity scores than did model_150_topics, but it also has a higher level of nuance than model_25_topics by virtue of having a higher level of nuance.